Conversation
Provisioning used to silently wait out the full 6000s timeout on any guest-side failure because the cleanup trap only logged the error. Now it writes STACK_CLOUD_INIT_FAILED and shuts the VM down, and the host waiter breaks on that marker and reports it distinctly. Also bump smoke test timeout 120s->300s, dump docker ps / container logs / free -m / verbose curl when it fails, log the qemu accel path, and enable /dev/kvm on the CI runner so the VM isn't stuck in TCG.
The arm64 matrix entry cross-compiles on the amd64 CI runner, so the guest runs under QEMU TCG. Under -cpu max, V8 emits armv8.5+ JIT code that TCG mistranslates and node crashes with SIGTRAP (exit 133) during migrations. Three changes together get it working: - Drop to -cpu cortex-a72 for TCG arm64 guests. Limits V8 to armv8.0-a which TCG handles cleanly. Native paths (HVF/KVM) keep -cpu max for full performance. - Run migrations with NODE_OPTIONS=--jitless as belt-and-suspenders. Migrations are I/O-bound so the perf hit is negligible. - Skip the in-guest smoke test on arm64. A full Next.js backend under cross-arch TCG either SIGTRAPs or times out; the amd64 build still runs the smoke test, which covers every non-arch-specific code path. Arch is propagated into the guest via a new build-arch.env marker in the stack-bundle ISO.
The previous commit set NODE_OPTIONS=--jitless on the migration docker exec. That was wrong for two reasons: - --jitless disables eval and new Function, which some code in the migration path uses, so it broke amd64 builds that had been passing. - --jitless is a V8 feature gate, not a TCG workaround. If it breaks one arch it breaks both — it could never have helped arm64 either. Revert the --jitless flag and rely on -cpu cortex-a72 (added in the parent commit) as the root-cause fix for the arm64 TCG SIGTRAP. Keep the stdout/stderr capture for the migration exec so the next failure dumps the actual node error through log-provision instead of being swallowed by the serial-only stream.
Cross-arch TCG on ubicloud-standard-8 either SIGTRAPs during migrations (old QEMU) or hangs in wait-for-deps with no progress. GitHub's ubuntu-24.04-arm runner is an Azure arm64 VM — same-arch TCG, no KVM (no nested virt exposed) — but empirically completes migrations, the dep setup, and image packaging end-to-end (verified on the diagnostics branch run). Only failure there was the backend smoke test hitting its 300s timeout, which the parent commit on this branch already skips on arm64. Keep amd64 on ubicloud-standard-8 for its KVM acceleration.
wait-for-deps used to loop forever on each service, so any single dep that failed to start (e.g. a service crash-looping under TCG) hung the build until the outer 6000s provision timeout. Rewrite as a wait_for helper with: - Hard 1500s budget across the full dep wait (overridable via STACK_DEPS_TIMEOUT). On timeout, dump docker ps -a, last 300 lines of the deps container, and per-service reachability, then exit 1 so provision-build's cleanup trap fires and the VM shuts down fast. - "<service> ready (Ns)" log lines on each service so successful runs show which service was the bottleneck. - 30s heartbeat per service so long-running waits don't look frozen. amd64 is unaffected — services come up in ~1s each under KVM, which is well inside any threshold here.
Same-arch TCG (e.g. arm64 guest on the arm64 ubuntu-24.04-arm runner that has no nested virt) was falling through to -cpu cortex-a72 too. Empirically that hangs wait-for-deps indefinitely — services never reach a ready state — probably because QEMU's TCG emulation of named CPU models is less well-tested than -cpu max, especially for the LSE atomic fallback paths the dep services exercise. The cortex-a72 workaround is only needed for cross-arch TCG, where V8 emits JIT instructions the amd64 host's TCG mistranslates. Restrict it to that case; same-arch TCG now gets -cpu max, matching the known working config from the diagnostics branch run on ubuntu-24.04-arm.
… V8 --jitless Flip arm64 matrix back to ubicloud-standard-8 so both arches share one runner fleet. Cross-arch TCG on an amd64 host previously SIGTRAP'd in migrations because V8's JIT emitted arm64 instructions that QEMU's cross-arch translator mis-handled; pair the existing -cpu cortex-a72 fallback with NODE_OPTIONS=--jitless on the migration docker exec to force V8 to stay on the interpreter. Does not affect amd64 migrations (KVM, no TCG).
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds multi‑arch QEMU CI, KVM enablement, UPX compression for emulator service binaries, enhanced QEMU build/run provisioning and logging, refactors cloud-init into a centralized provision-build flow with robust dependency waiting, and adds a headless per‑arch QEMU boot test script. Changes
Sequence Diagram(s)sequenceDiagram
participant Host as Host CI / Builder
participant Storage as Host files / qcow2
participant QEMU as QEMU VM (guest)
participant Guest as Guest init (/usr/local/bin/provision-build)
participant Deps as Deps container (inside guest)
Host->>Storage: prepare qcow2, seed ISO, build.env
Host->>QEMU: launch VM (arch-specific CPU, virtfs mount)
QEMU->>Guest: boot → cloud-init invokes provision-build
Guest->>Deps: start deps container (run-build-migrations)
Deps->>Guest: service probes report readiness
Guest->>Guest: write provision.log and stream to serial
Guest->>Host: write STACK_CLOUD_INIT_DONE/FAILED to hostfs/serial
Host->>Storage: collect serial.log & provision.log (on success/failure)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Plain --jitless disables V8's Wasm side-effectfully, which breaks Prisma 7's wasm query compiler on import (ReferenceError: WebAssembly is not defined in decodeBase64AsWasm). --no-opt only disables the TurboFan optimizer — the tier responsible for the aggressive arm64 instructions (PAC/BTI/LSE) that cross-arch TCG mistranslates — while leaving Sparkplug baseline and the Wasm JIT intact, so Prisma's wasm compiler runs at full speed.
Node's NODE_OPTIONS allowlist rejects --no-opt (unlike --jitless, which it happens to permit). Put the flag directly on the node command line inside the docker exec so V8 actually picks it up.
The clickhouse binary since 22.x is a small ELF loader with a ZSTD-compressed payload appended after the section table. strip rewrites the ELF and can invalidate the loader's lookup of its own trailing payload, causing it to decompress garbage and spin forever — the exact symptom on cross-arch TCG runs where clickhouse-server produced zero log output while postgres/redis/svix/minio/qstash (none of them self-extracting) all started fine under identical settings. Stripping was a no-op for size anyway; the payload bytes live outside any section and strip can't touch them.
With strip no longer corrupting the ClickHouse self-extractor, clickhouse-server now reaches first-instruction execution and immediately SIGILLs in a supervisord crash loop. Root cause: its statically-linked LSE atomics (armv8.1) are rejected under -cpu cortex-a72 (armv8.0). cortex-a76 is armv8.2-a: LSE available, but no PAC (v8.3) and no BTI (v8.5), so V8's aggressive JIT tiers still don't see the feature flags that tripped cross-arch TCG's translator on the old -cpu max runs. Combined with `node --no-opt` on migrations (Ignition+Sparkplug only, no TurboFan/Maglev), this is the narrow CPU profile that should let both V8 and ClickHouse coexist under cross-arch TCG.
Two independent reasons this can't work under cross-arch TCG on the
ubicloud amd64 runner:
1. The backend at runtime runs without --no-opt (we only apply the
flag to the one-shot migration exec). That means TurboFan is live
and will re-emit the aggressive arm64 JIT code the original
-cpu max runs SIGTRAP'd on. Baking --no-opt into the runtime
entrypoint would ship in the image and permanently degrade real
arm64 users (who have KVM and don't need it).
2. Even if we fixed (1), next start under cross-arch TCG is too slow
to come up within any reasonable timeout.
amd64 verify under KVM already exercises the image's service stack;
the arm64 artifact is built from the same Dockerfile and trusted to
run on real arm64 hardware where KVM is available.
After fixing the JS-side SIGTRAP with --no-opt and getting ClickHouse
happy on cortex-a76, migrations finally ran on cross-arch TCG and
immediately hit a V8 internal assertion in Runtime_WasmTriggerTierUp:
Check failed: it->second.Size() > offset.
Heap::GcSafeFindCodeForInnerPointer
InnerPointerToCodeCache::GetCacheEntry
StackFrameIterator::Advance
Runtime_WasmTriggerTierUp
Same class of bug as the JS SIGTRAP, just in the wasm pipeline: once
Prisma 7's wasm query compiler gets hot, V8 walks the stack to promote
from Liftoff to TurboFan, and cross-arch TCG's translated inner
pointers don't line up with V8's code-cache entries.
--no-opt only affects the JS tiers; wasm has its own. Pin wasm to
Liftoff with --no-wasm-tier-up. Liftoff is still JITed (unlike
--wasm-jitless, which would force the interpreter and tank migration
time), so Prisma speed stays reasonable.
Greptile SummaryThis PR improves the local emulator build by: (1) consolidating both amd64 and arm64 CI jobs onto a single Confidence Score: 5/5Safe to merge — all findings are P2 style/robustness suggestions that do not affect correctness or the primary build path. The cross-arch TCG workarounds (cortex-a76, --no-opt, --no-wasm-tier-up) are well-reasoned and thoroughly commented. The CI workflow correctly gates smoke tests to amd64. The Dockerfile UPX stage carefully skips ClickHouse. Remaining issues are limited to the test-serial.sh utility (macOS-only accelerator, duplicate log printing) and a missing container cleanup in slim-docker-image — none of these block the production build path. docker/local-emulator/qemu/test-serial.sh (hvf hardcoded, Linux incompatible) and docker/local-emulator/qemu/cloud-init/emulator/user-data (flatten container not cleaned on failure) Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[CI: ubicloud-standard-8 runner] --> B{matrix.arch}
B -->|amd64| C[KVM acceleration]
B -->|arm64| D[cross-arch TCG\ncortex-a76 CPU]
C --> E[build-image.sh]
D --> E
E --> F[Boot QEMU VM with cloud-init]
F --> G[install-emulator-containers]
G --> H[run-build-migrations\nnode --no-opt --no-wasm-tier-up]
H --> I[slim-docker-image]
I --> J{BUILD_ARCH == arm64?}
J -->|yes| K[Skip smoke test]
J -->|no| L[Smoke test curl /health?db=1]
K --> M[docker export/import flatten]
L --> M
M --> N[Nuke + rebuild Docker storage]
N --> O[Zero free space + fstrim]
O --> P[shutdown -P now]
P --> Q[Host: qemu-img convert qcow2]
Q --> R{matrix.arch == amd64?}
R -->|yes| S[Start + Verify + Stop emulator]
R -->|no| T[Skip runtime verify]
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
docker/local-emulator/qemu/build-image.sh (1)
277-280: VerifyBUILD_ENV_FILEis set before use.
BUILD_ENV_FILEis used on line 277 (cp "$BUILD_ENV_FILE" ...) but is only set later on line 393. Sincebuild_oneis called from the loop starting at line 395, which comes after line 393, this should work correctly in practice.However, if someone were to call
build_onedirectly or reorder the script, this would fail. Consider moving the env generation to the top of the script or passingBUILD_ENV_FILEas a parameter tobuild_one.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docker/local-emulator/qemu/build-image.sh` around lines 277 - 280, The script copies BUILD_ENV_FILE inside build_one but BUILD_ENV_FILE is only set later, so ensure build_one always has a valid BUILD_ENV_FILE: either (a) move the generation of BUILD_ENV_FILE to run before build_one is ever called, or (b) change build_one to accept BUILD_ENV_FILE as a parameter and validate it at the top of build_one (error out if unset or non-readable) before the cp "$BUILD_ENV_FILE" "$bundle_dir/build.env" line; reference the build_one function and the BUILD_ENV_FILE variable when making the change so callers are updated accordingly.docker/local-emulator/qemu/test-serial.sh (1)
104-109: Polling loop prints duplicateSTACK_PROVISIONlines on every iteration.The current logic
greps allSTACK_PROVISIONlines every 2 seconds, so the same lines are printed repeatedly until success or timeout. This clutters the output.🔧 Track last printed line count to show only new lines
elapsed=0 timeout=120 +last_provision_lines=0 while [ "$elapsed" -lt "$timeout" ]; do if grep -q "STACK_CLOUD_INIT_DONE" "$TMP_DIR/serial.log" 2>/dev/null; then echo "" echo "=== SUCCESS: STACK_CLOUD_INIT_DONE received ===" echo "" echo "=== All STACK_PROVISION lines ===" grep "STACK_PROVISION" "$TMP_DIR/serial.log" || echo "(none found)" exit 0 fi # Show any STACK_PROVISION lines as they appear - if grep -q "STACK_PROVISION" "$TMP_DIR/serial.log" 2>/dev/null; then - grep "STACK_PROVISION" "$TMP_DIR/serial.log" | while IFS= read -r line; do - echo " [${elapsed}s] $line" - done - fi + current_lines=$(grep -c "STACK_PROVISION" "$TMP_DIR/serial.log" 2>/dev/null || echo 0) + if [ "$current_lines" -gt "$last_provision_lines" ]; then + grep "STACK_PROVISION" "$TMP_DIR/serial.log" | tail -n +$((last_provision_lines + 1)) | while IFS= read -r line; do + echo " [${elapsed}s] $line" + done + last_provision_lines=$current_lines + fi sleep 2🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docker/local-emulator/qemu/test-serial.sh` around lines 104 - 109, The polling loop prints duplicate STACK_PROVISION lines because each iteration greps the entire "$TMP_DIR/serial.log"; change it to print only new lines by tracking the last printed line count (e.g., last_provision_lines or last_index) and using tail (or grep with tail -n +<last+1>) against "$TMP_DIR/serial.log" instead of re-grepping the whole file; update the counter after printing (use the number of lines currently matching STACK_PROVISION via grep -c or wc -l) so subsequent iterations only echo new STACK_PROVISION entries and avoid duplicates while still showing elapsed with each new line.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docker/local-emulator/qemu/test-serial.sh`:
- Around line 65-74: The qemu invocation hardcodes accel="hvf" for both ARCH
branches causing failures on non-macOS hosts; update the logic around ARCH,
accel, and qemu_base to detect the host accelerator the same way as
build-image.sh or run-emulator.sh (reuse their detection function/logic) and
fall back to a portable option (e.g., tcg) when hvf is unavailable; ensure
find_aarch64_firmware is still used for arm64 but only set accel when detection
returns a supported backend and include that accel variable in the qemu_base
construction.
---
Nitpick comments:
In `@docker/local-emulator/qemu/build-image.sh`:
- Around line 277-280: The script copies BUILD_ENV_FILE inside build_one but
BUILD_ENV_FILE is only set later, so ensure build_one always has a valid
BUILD_ENV_FILE: either (a) move the generation of BUILD_ENV_FILE to run before
build_one is ever called, or (b) change build_one to accept BUILD_ENV_FILE as a
parameter and validate it at the top of build_one (error out if unset or
non-readable) before the cp "$BUILD_ENV_FILE" "$bundle_dir/build.env" line;
reference the build_one function and the BUILD_ENV_FILE variable when making the
change so callers are updated accordingly.
In `@docker/local-emulator/qemu/test-serial.sh`:
- Around line 104-109: The polling loop prints duplicate STACK_PROVISION lines
because each iteration greps the entire "$TMP_DIR/serial.log"; change it to
print only new lines by tracking the last printed line count (e.g.,
last_provision_lines or last_index) and using tail (or grep with tail -n
+<last+1>) against "$TMP_DIR/serial.log" instead of re-grepping the whole file;
update the counter after printing (use the number of lines currently matching
STACK_PROVISION via grep -c or wc -l) so subsequent iterations only echo new
STACK_PROVISION entries and avoid duplicates while still showing elapsed with
each new line.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 2512a391-b48e-43c0-a6ba-4b2539dfaff4
📒 Files selected for processing (6)
.github/workflows/qemu-emulator-build.yamldocker/local-emulator/Dockerfiledocker/local-emulator/qemu/build-image.shdocker/local-emulator/qemu/cloud-init/emulator/user-datadocker/local-emulator/qemu/run-emulator.shdocker/local-emulator/qemu/test-serial.sh
…pers - wait-for-deps: table-drive wait probes via a SERVICES array (name|probe pairs) - add /usr/local/bin/log-provision-stream for the repeated `cmd | while read line; do log prefix: $line; done` pattern - factor write_marker_to_consoles in provision-build - drop dead case sanitizer in build-image.sh:line_count - tighten narrating comments in workflow and run-build-migrations
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
.github/workflows/qemu-emulator-build.yaml (1)
83-103: Arm64 is no longer boot-validated in this workflow.With the build-job runtime checks gated to
amd64and thetestjob only coveringamd64, the packagedarm64qcow2 never gets a post-build boot check. That makes first-boot regressions in the final image releasable. Please add a lightweight arm64 boot/readiness check instead of dropping validation entirely.Also applies to: 121-130
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/qemu-emulator-build.yaml around lines 83 - 103, The workflow currently only starts and verifies the emulator when matrix.arch == 'amd64', leaving arm64 qcow2 unvalidated; update the steps "Start emulator and verify", "Verify services are healthy", and "Stop emulator" to also run for arm64 but in a lightweight smoke-check mode: change the if conditions to include matrix.arch == 'arm64' as well, and when matrix.arch == 'arm64' set a flag/env var (for example EMULATOR_SMOKE_CHECK=true or EMULATOR_MODE=smoke) and reduced checks/timeouts (e.g. EMULATOR_READY_TIMEOUT lower and fewer service probes) before invoking docker/local-emulator/qemu/run-emulator.sh with EMULATOR_ARCH=${{ matrix.arch }} so the arm64 image receives a minimal boot/readiness validation without full amd64-length testing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docker/local-emulator/qemu/cloud-init/emulator/user-data`:
- Around line 269-279: The docker run invocation for the deps container uses
--rm which removes the container immediately on exit, preventing later
inspection; remove the --rm flag from the docker run that creates the container
named stack-build-init and instead add explicit cleanup logic that removes (and
force-removes on failure if needed) the stack-build-init container at the end of
the surrounding script or in an EXIT trap; update the start sequence (the docker
run with name stack-build-init and env STACK_DEPS_ONLY) and add a corresponding
docker rm (or docker rm -f) cleanup step so logs and docker ps -a remain
available for diagnostics until the script explicitly cleans up.
- Around line 301-315: The migration docker exec invocation uses node flags
"--no-opt --no-wasm-tier-up" which still allows baseline JIT; update the sh -c
command that runs "node --no-opt --no-wasm-tier-up dist/db-migrations.mjs
migrate" and the subsequent seed invocation to use "--jitless --no-wasm-tier-up"
instead (replace both "--no-opt" occurrences) so db-migrations.mjs runs in
interpreter-only mode; keep the surrounding docker exec, env-file usage, and
redirection to migrate_log unchanged.
---
Nitpick comments:
In @.github/workflows/qemu-emulator-build.yaml:
- Around line 83-103: The workflow currently only starts and verifies the
emulator when matrix.arch == 'amd64', leaving arm64 qcow2 unvalidated; update
the steps "Start emulator and verify", "Verify services are healthy", and "Stop
emulator" to also run for arm64 but in a lightweight smoke-check mode: change
the if conditions to include matrix.arch == 'arm64' as well, and when
matrix.arch == 'arm64' set a flag/env var (for example EMULATOR_SMOKE_CHECK=true
or EMULATOR_MODE=smoke) and reduced checks/timeouts (e.g. EMULATOR_READY_TIMEOUT
lower and fewer service probes) before invoking
docker/local-emulator/qemu/run-emulator.sh with EMULATOR_ARCH=${{ matrix.arch }}
so the arm64 image receives a minimal boot/readiness validation without full
amd64-length testing.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ca391b1b-bd15-4420-b9f6-00222b89cc62
📒 Files selected for processing (3)
.github/workflows/qemu-emulator-build.yamldocker/local-emulator/qemu/build-image.shdocker/local-emulator/qemu/cloud-init/emulator/user-data
🚧 Files skipped from review as they are similar to previous changes (1)
- docker/local-emulator/qemu/build-image.sh
…ation exec Migrations under cross-arch TCG were flaky with Runtime_WasmTriggerTierUp -> StackFrameIterator check failures (SIGTRAP/exit 133). --no-wasm-tier-up alone didn't suppress the dynamic tier-up decision runtime call, and Wasm code-GC during a stack walk can leave the inner-pointer-to-code cache stale.
… V8 --jitless
2.6 GB to 1.3 GB final image
Flip arm64 matrix back to ubicloud-standard-8 so both arches share one runner fleet. Cross-arch TCG on an amd64 host previously SIGTRAP'd in migrations because V8's JIT emitted arm64 instructions that QEMU's cross-arch translator mis-handled; pair the existing -cpu cortex-a72 fallback with NODE_OPTIONS=--jitless on the migration docker exec to force V8 to stay on the interpreter. Does not affect amd64 migrations (KVM, no TCG).
Summary by CodeRabbit
Chores
New Features
Tests